merge master by fengjian428 · Pull Request #4 · fengjian428/hudi

fengjian428 · 2022-03-13T13:27:22Z

Tips

Thank you very much for contributing to Apache Hudi.
Please review https://hudi.apache.org/contribute/how-to-contribute before opening a pull request.

What is the purpose of the pull request

(For example: This pull request adds quick-start document.)

Brief change log

(for example:)

Modify AnnotationLocation checkstyle rule in checkstyle.xml

Verify this pull request

(Please pick either of the following options)

This pull request is a trivial rework / code cleanup without any test coverage.

(or)

This pull request is already covered by existing tests, such as (please describe tests).

(or)

This change added tests and can be verified as follows:

(example:)

Added integration tests for end-to-end.
Added HoodieClientWriteTest to verify the change.
Manually verified the change by running a job locally.

Committer checklist

Has a corresponding JIRA in PR title & commit
Commit message is descriptive of the change
CI is green
Necessary doc changes done or have another open PR
For large changes, please consider breaking it into sub-tasks under an umbrella JIRA.

…4728) - Updating the schema used for data skipping index

…4941) * Fixing populateMeta fields update to metadata table * Fix checkstyle violations Co-authored-by: Sagar Sumit <sagarsumit09@gmail.com>

… interfaces first (#4942) * In some complex network environment, the current code returns wildcard address 0.0.0.0 which is not desired.

…nc.enable" directly, async clustering not work (#4905) Co-authored-by: Rex An <bonean131@gmail.com>

…d for Spark SQL (#4901) * [HUDI-3445] Clustering Command Based on Call Procedure Command for Spark SQL * [HUDI-3445] Clustering Command Based on Call Procedure Command for Spark SQL * [HUDI-3445] Clustering Command Based on Call Procedure Command for Spark SQL Co-authored-by: shibei <huberylee.li@alibaba-inc.com>

* flink TM memory Optimization

* Fixing timeline server for repeated refreshes

…partitions having different schemas (#4468) * Fixing Hive getSchema for RT tables * Addressing feedback * temp diff * fixing tests after spark datasource read support for metadata table is merged to master * Adding multi-partition schema evolution tests to HoodieRealTimeRecordReader Co-authored-by: Aditya Tiwari <aditya.tiwari@flipkart.com> Co-authored-by: sivabalan <n.siva.b@gmail.com>

…ing _hoodie_is_deleted column to schema (#4921)

…cess before it is transformed to DataSet (#4930)

…ssary hoodie records (#4932) * log scanner optimization * payload equals switches to `=` Co-authored-by: 苏承祥 <sucx@tuya.com>

…4811) * Making commit preserve metadata to true * Fixing integ tests * Fixing preserve commit metadata for metadata table * fixed bootstrap tests * temp diff * Fixing merge handle * renaming fallback record * fixing build issue * Fixing test failures

… object into the closure for Spark to serialize (#4954) - Avoid including whole MultipleSparkJobExecutionStrategy object into the closure for Spark to serialize

…y on HDFS (#4739) - This change makes sure MT records are updated appropriately on HDFS: previously after Log File append operations MT records were updated w/ just the size of the deltas being appended to the original files, which have been found to be the cause of issues in case of Rollbacks that were instead updating MT with records bearing the full file-size. - To make sure that we hedge against similar issues going f/w, this PR alleviates this discrepancy and streamlines the flow of MT table always ingesting records bearing full file-sizes.

Co-authored-by: yuezhang <yuezhang@freewheel.tv>

)

For flink insert overwrite operation, do the cleaning each time before the write.

close #4959

…#4970)

…lter construction from index based on the type param (#4848) Rework of #4761 This diff introduces following changes: - Write stats are converted to metadata index records during the commit. Making them use the HoodieData type so that the record generation scales up with needs. - Metadata index init support for bloom filter and column stats partitions. - When building the BloomFilter from the index records, using the type param stored in the payload instead of hardcoded type. - Delta writes can change column ranges and the column stats index need to be properly updated with new ranges to be consistent with the table dataset. This fix add column stats index update support for the delta writes. Co-authored-by: Manoj Govindassamy <manoj.govindassamy@gmail.com>

…ulti processors at once (#4968)

Desc: Add a hive sync config(hoodie.datasource.hive_sync.sync_comment). This config defaults to false. While syncing data source to hudi, add column comments to source avro schema, and the sync_comment is true, syncing column comments to the hive table.

… columns (#4818) NOTE: This change is first part of the series to clean up Hudi's Spark DataSource related implementations, making sure there's minimal code duplication among them, implementations are consistent and performant This PR is making sure that BaseFileOnlyViewRelation only reads projected columns as well as avoiding unnecessary serde from Row to InternalRow Brief change log - Introduced HoodieBaseRDD as a base for all custom RDD impls - Extracted common fields/methods to HoodieBaseRelation - Cleaned up and streamlined HoodieBaseFileViewOnlyRelation - Fixed all of the Relations to avoid superfluous Row <> InternalRow conversions

…ng buildx (#5011)

…olumns from schema (#4972) * [HUDI-3522] Introduce DropColumnSchemaPostProcessor to support drop columns from schema * Fix case sensitivity

* [HUDI-2999] rfc for consistent hashing index * [HUDI-2999] review: add metadata table & non-dual-write solution (virtual log file) for resizing Co-authored-by: xiaoyuwei <xiaoyuwei.yw@alibaba-inc.com>

Co-authored-by: 苏承祥 <sucx@tuya.com>

… in TestSchemaPostProcessor (#5019)

…4982)

…load previous Index Table state (#5015)

#4999) Co-authored-by: Rex An <bonean131@gmail.com>

…etting multi processors at once (#4969)

…d compaction if rollback failed mid-way (#4971)

#5013) Create new TypedProperties while performing clustering Add OrderedProperties and minor refactoring Add javadoc and remove getters from OrderedProperties

…4984) Co-authored-by: Y Ethan Guo <ethan.guoyihua@gmail.com>

…5025)

manojpec and others added 30 commits March 3, 2022 15:56

[HUDI-2973] RFC-27: Data skipping index to improve query performance (#…

51ee500

…4728) - Updating the schema used for data skipping index

[HUDI-3544] Fixing "populate meta fields" update to metadata table (#…

876a891

…4941) * Fixing populateMeta fields update to metadata table * Fix checkstyle violations Co-authored-by: Sagar Sumit <sagarsumit09@gmail.com>

[HUDI-3552] Strength the NetworkUtils#getHostname by checking network…

a4ba0ff

… interfaces first (#4942) * In some complex network environment, the current code returns wildcard address 0.0.0.0 which is not desired.

[HUDI-3548] Fix if user specify key "hoodie.datasource.clustering.asy…

be9a264

…nc.enable" directly, async clustering not work (#4905) Co-authored-by: Rex An <bonean131@gmail.com>

[HUDI-3161][RFC-47] Add Call Produce Command for Spark SQL (#4607)

6faed3d

[MINOR] fix UTC timezone config (#4950)

f449807

[HUDI-3348] Add UT to verify HoodieRealtimeFileSplit serde (#4951)

b4362fa

[HUDI-3460] Add reader merge memory option for flink (#4911)

0986d5a

* flink TM memory Optimization

[HUDI-2761] Fixing timeline server for repeated refreshes (#4812)

6a46130

* Fixing timeline server for repeated refreshes

[HUDI-3520] Introduce DeleteSupportSchemaPostProcessor to support add…

4b47177

…ing _hoodie_is_deleted column to schema (#4921)

[HUDI-3525] Introduce JsonkafkaSourceProcessor to support data prepro…

c9ffdc4

…cess before it is transformed to DataSet (#4930)

[HUDI-3069] Improve HoodieMergedLogRecordScanner avoid putting unnece…

6f57bbf

…ssary hoodie records (#4932) * log scanner optimization * payload equals switches to `=` Co-authored-by: 苏承祥 <sucx@tuya.com>

[HUDI-3561] Avoid including whole MultipleSparkJobExecutionStrategy…

f0bcee3

… object into the closure for Spark to serialize (#4954) - Avoid including whole MultipleSparkJobExecutionStrategy object into the closure for Spark to serialize

[HUDI-2747] support set --sparkMaster for MDT cli (#4964)

53826d6

Co-authored-by: yuezhang <yuezhang@freewheel.tv>

[HUDI-3576] Configuring timeline refreshes based on latest commit (#4973

2904076

)

[HUDI-3573] flink cleanFuntion execute clean on initialization (#4936)

34bc752

For flink insert overwrite operation, do the cleaning each time before the write.

[MINOR][HUDI-3460]Fix HoodieDataSourceITCase

b6bdb46

close #4959

[HUDI-2677] Add DFS based message queue for flink writer[part3] (#4961)

fe53bd2

[HUDI-3574] Improve maven module configs for different spark profiles (…

2538580

…#4970)

[HUDI-3584] Skip integ test modules by default (#4986)

ed26c52

[HUDI-3221] Support querying a table as of a savepoint (#4720)

08fd80c

[HUDI-3587] Making SupportsUpgradeDowngrade serializable (#4991)

4324e87

[HUDI-3568] Introduce ChainedSchemaPostProcessor to support setting m…

548000b

…ulti processors at once (#4968)

[MINOR] Add IT CI Test timeout option (#5003)

ca0b8fc

Alexey Kudinkin and others added 17 commits March 9, 2022 21:45

[HUDI-3581] Reorganize some clazz for hudi flink (#4983)

ec24407

[HUDI-3602][DOCS] Update docker README to build multi-arch images usi…

4e09545

…ng buildx (#5011)

[HUDI-3586] Add Trino Queries in integration tests (#4988)

fa5e750

[HUDI-3595] Fixing NULL schema provider for empty batch (#5002)

9dc6df5

[HUDI-3522] Introduce DropColumnSchemaPostProcessor to support drop c…

83cff3a

…olumns from schema (#4972) * [HUDI-3522] Introduce DropColumnSchemaPostProcessor to support drop columns from schema * Fix case sensitivity

[HUDI-2999] [RFC-42] RFC for consistent hashing index (#4326)

18cdad9

* [HUDI-2999] rfc for consistent hashing index * [HUDI-2999] review: add metadata table & non-dual-write solution (virtual log file) for resizing Co-authored-by: xiaoyuwei <xiaoyuwei.yw@alibaba-inc.com>

[HUDI-3566] Add thread factory in BoundedInMemoryExecutor (#4926)

faed699

Co-authored-by: 苏承祥 <sucx@tuya.com>

[HUDI-3575] Use HoodieTestDataGenerator#TRIP_SCHEMA as example schema…

b001803

… in TestSchemaPostProcessor (#5019)

[HUDI-3567] Refactor HoodieCommonUtils to make code more reasonable (#…

56cb494

…4982)

[HUDI-3513] Make sure Column Stats does not fail in case it fails to …

5d59bf6

…load previous Index Table state (#5015)

[HUDI-3592] Fix NPE of DefaultHoodieRecordPayload if Property is empty (

93277b2

#4999) Co-authored-by: Rex An <bonean131@gmail.com>

[HUDI-3569] Introduce ChainedJsonKafkaSourePostProcessor to support s…

e8918b6

…etting multi processors at once (#4969)

[HUDI-3556] Re-use rollback instant for rolling back of clustering an…

e7bb041

…d compaction if rollback failed mid-way (#4971)

[HUDI-3593] Restore TypedProperties and flush checksum in table config (

eee96e9

#5013) Create new TypedProperties while performing clustering Add OrderedProperties and minor refactoring Add javadoc and remove getters from OrderedProperties

[HUDI-3583] Fix MarkerBasedRollbackStrategy NoSuchElementException (#…

e60acc1

…4984) Co-authored-by: Y Ethan Guo <ethan.guoyihua@gmail.com>

[HUDI-3501] Support savepoints command based on Call Produce Command (#…

6c8224c

…5025)

fengjian428 merged commit 06ea24c into fengjian428:master Mar 13, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

merge master#4

merge master#4
fengjian428 merged 47 commits intofengjian428:masterfrom
apache:master

fengjian428 commented Mar 13, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

18 participants

Conversation

fengjian428 commented Mar 13, 2022

Tips

What is the purpose of the pull request

Brief change log

Verify this pull request

Committer checklist

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

18 participants